Skip to content

feat(runtime)!: persist response chains on neon postgresql#211

Open
xirothedev wants to merge 7 commits intoSoju06:mainfrom
xirothedev:feat/previous-response-and-neon-postgres
Open

feat(runtime)!: persist response chains on neon postgresql#211
xirothedev wants to merge 7 commits intoSoju06:mainfrom
xirothedev:feat/previous-response-and-neon-postgres

Conversation

@xirothedev
Copy link
Copy Markdown

Closes #210

Summary

  • persist durable response snapshots keyed by response_id and replay previous_response_id across collected HTTP responses, streaming HTTP responses, and WebSocket Responses traffic
  • require Neon/PostgreSQL runtime configuration with a dedicated CODEX_LB_DATABASE_MIGRATION_URL, and remove SQLite-first runtime assumptions from the main execution path
  • normalize timezone-aware API key expiration datetimes to UTC-naive values before persisting them to PostgreSQL

Root Cause

previous_response_id continuity gap

The proxy could not delegate previous_response_id to ChatGPT upstream, but it also did not persist enough local state to rebuild prior turns. That left /v1/responses incompatible with clients that continue a conversation by referencing the previous response id.

Runtime database contract drift

The repo still defaulted to SQLite-oriented runtime behavior in config, startup, and docs even though the intended deployment model is Neon/PostgreSQL. That mismatch made the runtime contract and operator guidance diverge.

Timezone-aware API key expiration failures

expiresAt values from the dashboard include timezone information, but the backend wrote them into a timestamp without time zone column without first normalizing them, which caused PostgreSQL writes to fail.

Changes

Area Details
Responses compatibility Add durable response snapshots, recursive previous_response_id replay, shared output-item accumulation, and preferred-account reuse/fallback for /v1/responses over HTTP and WebSocket
Database runtime Require PostgreSQL/Neon runtime URLs, add dedicated migration DSN handling, normalize asyncpg SSL params for Docker deployments, and remove SQLite runtime fallback assumptions
API keys Normalize timezone-aware expiration datetimes before create/update persistence so ISO 8601 values with offsets remain accepted
Specs and docs Update OpenSpec change artifacts and main specs for responses compatibility, database backend/migration behavior, and API key expiration handling
Regression coverage Extend integration and unit coverage around snapshot persistence, migration/runtime configuration, routing fallback, websocket continuity, and API key expiration normalization

Testing

  • .venv/bin/python -m compileall app tests
  • ⚠️ .venv/bin/pytest tests/unit/test_api_keys_service.py tests/unit/test_db_session.py tests/unit/test_openai_requests.py tests/unit/test_proxy_utils.py tests/integration/test_migrations.py tests/integration/test_openai_compat_features.py tests/integration/test_proxy_websocket_responses.py tests/integration/test_load_balancer_integration.py could not run in this environment because CODEX_LB_TEST_DATABASE_URL is not set
  • ⚠️ openspec validate --specs could not run in this environment because the openspec CLI is not installed

Add durable response snapshots and replay previous_response_id across HTTP and websocket responses. Prefer the original serving account when it is still eligible, add migration coverage, and sync the OpenSpec change and main specs.
Require PostgreSQL runtime configuration, add a dedicated migration DSN, and remove SQLite-specific startup behavior from the main runtime path.

Normalize asyncpg SSL query parameters for Docker deployments and update OpenSpec, docs, compose, and tests for the Neon-first database contract.
@xirothedev xirothedev requested a review from Soju06 as a code owner March 15, 2026 15:39
Copilot AI review requested due to automatic review settings March 15, 2026 15:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements durable previous_response_id replay for /v1/responses (HTTP + WebSocket) by persisting response snapshots, adds routing continuity by preferring the prior account when eligible, and shifts runtime/test expectations to Neon PostgreSQL with a dedicated migration DSN. Also fixes API key expiration persistence by normalizing timezone-aware datetimes to UTC-naive.

Changes:

  • Persist and resolve durable response snapshots to rebuild prior input/output history for previous_response_id, with HTTP/WebSocket parity and integration coverage.
  • Prefer the prior serving account for replayed requests (with fallback), and add regression coverage for the new selection behavior.
  • Require explicit PostgreSQL configuration (runtime + tests), introduce migration DSN resolution, and normalize asyncpg SSL URL query params; plus API key expiry timezone normalization.

Reviewed changes

Copilot reviewed 46 out of 47 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Bumps local package version in lockfile.
app/modules/proxy/service.py Adds snapshot persistence, previous_response_id resolution, replay input reconstruction, and preferred-account routing for Responses (HTTP + WS).
app/modules/proxy/response_snapshots_repository.py New repository for response snapshot get/upsert.
app/modules/proxy/request_policy.py Adds helper to build OpenAI invalid_request_error envelopes with param.
app/modules/proxy/repo_bundle.py Extends proxy repo bundle to include response_snapshots.
app/modules/proxy/load_balancer.py Adds optional preferred_account_id to bias account selection when eligible.
app/modules/api_keys/service.py Normalizes timezone-aware expires_at values to UTC-naive before persistence.
app/dependencies.py Wires ResponseSnapshotsRepository into proxy dependencies.
app/db/models.py Adds ResponseSnapshot ORM model + index.
app/db/alembic/versions/20260315_120000_add_response_snapshots.py Adds migration creating response_snapshots table + index.
app/db/alembic/env.py Alembic now uses migration DSN when configured; removes sqlite batch mode toggles.
app/db/migrate.py CLI defaults to migration DSN (or runtime DSN) when --db-url not supplied.
app/db/session.py Removes SQLite startup logic; adds asyncpg URL normalization and runs startup migrations against migration DSN.
app/core/openai/requests.py Stops rejecting previous_response_id; enforces mutual exclusivity with conversation.
app/core/config/settings.py Requires CODEX_LB_DATABASE_URL; adds CODEX_LB_DATABASE_MIGRATION_URL with fallback to runtime DSN.
tests/unit/test_proxy_utils.py Adds unit coverage for rebuilding upstream input from stored snapshots in WS prepare flow.
tests/unit/test_openai_requests.py Updates validation tests for previous_response_id acceptance + conflict with conversation.
tests/unit/test_db_session.py Updates session/init_db tests for required DB URL, migration DSN behavior, and asyncpg URL normalization.
tests/unit/test_api_keys_service.py Adds regression tests for timezone-aware expires_at normalization on create/update.
tests/integration/test_openai_compat_features.py Adds integration coverage for unknown previous_response_id and replay after restart for HTTP.
tests/integration/test_proxy_websocket_responses.py Adds integration coverage for replay after restart for WebSocket Responses.
tests/integration/test_migrations.py Verifies migrations create the response_snapshots table/columns.
tests/integration/test_load_balancer_integration.py Adds integration coverage for preferred-account selection + fallback.
tests/conftest.py Requires explicit PostgreSQL test DSN env vars; switches temp key file to tmp_path.
README.md Updates runtime instructions/docs to require Neon PostgreSQL and separate migration DSN.
.env.example Updates env template to Neon PostgreSQL runtime + migration DSNs (and sets encryption key file).
docker-compose.yml Removes local Postgres service/volume.
openspec/specs/responses-api-compat/spec.md Updates spec to define durable previous_response_id replay + routing continuity requirements.
openspec/specs/responses-api-compat/context.md Updates context to reflect proxy-local previous_response_id support and error semantics.
openspec/specs/api-keys/spec.md Updates spec to require timezone-aware expiresAt acceptance + normalization.
openspec/specs/database-migrations/spec.md Updates migration spec to use resolved migration DSN.
openspec/specs/database-migrations/context.md Updates migration DSN resolution and removes sqlite backup notes.
openspec/specs/database-backends/spec.md Updates backend spec to require Neon PostgreSQL and explicit test DSNs.
openspec/specs/database-backends/context.md Updates backend context for Neon pooled vs direct DSNs.
openspec/specs/query-caching/context.md Updates context wording for PostgreSQL-backed runtime.
openspec/changes/support-previous-response-id-persistence/* Adds proposal/design/tasks + modified specs for durable previous_response_id support.
openspec/changes/neon-postgresql-runtime/* Adds proposal/tasks + modified specs for Neon-first runtime behavior.
openspec/changes/fix-api-key-expiration-timezone/* Adds proposal/tasks + modified spec for expiry timezone normalization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 972251abce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72cd7e8885

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1867 to +1870
resolved_request = await self._resolve_previous_response_request(
payload,
api_key_id=api_key.id if api_key else None,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move previous_response lookup inside settlement guard

_resolve_previous_response_request() now runs before the try/finally that releases stream API-key reservations, so any non-ProxyResponseError raised while loading snapshots (for example a transient DB failure during repos.response_snapshots.get) exits _stream_with_retry() without releasing api_key_reservation. Because reservations are created before streaming starts, repeated failures on requests that use previous_response_id can accumulate leaked reservations and block that key from further traffic.

Useful? React with 👍 / 👎.

@xirothedev
Copy link
Copy Markdown
Author

Follow-up fix for the first-turn WebSocket transport fallback issue.

This patch updates the non-terminal upstream disconnect path to:

  • log the disconnect context before terminal events
  • penalize the failed account with stream_incomplete
  • retry exactly once on a different account before surfacing failure downstream

Regression coverage added for:

  • retrying once after upstream EOF before response.created
  • preserving the existing response.failed behavior when retry is not possible
  • keeping previous_response_id integration checks green

Verification run locally:

  • .venv/bin/python -m pytest tests/integration/test_proxy_websocket_responses.py
  • `.venv/bin/python -m pytest tests/integration/test_openai_compat_features.py -k 'previous_response_id'

@xirothedev
Copy link
Copy Markdown
Author

Added an image-based VPS deploy path in this branch: manual branch image build workflow () and for pull-based deploys. Triggering the first image build now to avoid VPS OOM during local Docker builds.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a80159f1d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1281 to +1282
await self._handle_stream_error(account, {"message": disconnect_message}, "stream_incomplete")
retried = await self._retry_websocket_request_after_disconnect(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid penalizing account on idle websocket disconnects

This disconnect path now records stream_incomplete against the current account before checking whether any requests are actually in flight. When the upstream websocket closes during an idle period (for example, normal upstream idle timeout between response.create calls), _handle_stream_error still runs and record_error marks a healthy account as failing, which can skew later account selection and cause avoidable failover. The penalty should only be applied when a pending request was interrupted (or after retry failure for such a request).

Useful? React with 👍 / 👎.

@Soju06
Copy link
Copy Markdown
Owner

Soju06 commented Mar 17, 2026

I don't understand the clear reason for removing sqlite
I use sqlite, and it is not a fallback. Postgres is optional and sqlite is the default
However, the session cache hit improvement looks good. Please split the PR and submit it separately

@tiagoefreitas
Copy link
Copy Markdown

does this fix the invalid_request_error errors in codex 106 mentioned in #221 ?

@yigitkonur
Copy link
Copy Markdown
Contributor

thanks, this makes sense.

i went through #211 again against current main and i agree sqlite should stay the default and postgres should stay optional.

also, the cache/continuity side of this branch is no longer best carried by #211 as-is. main already moved forward with the bridge-based path in #220, #232, #236, and #239, so i’m not going to push the neon/sqlite runtime rewrite or the snapshot persistence approach from this branch.

i split out the remaining unmerged piece that still looked useful: timezone-aware api key expiration normalization. that pr is here: #249

so from my side, #211 should be treated as superseded rather than trimmed down in place.

@yigitkonur
Copy link
Copy Markdown
Contributor

Following up on the split request here: I opened #270 as the standalone SQLite-first version of the useful continuity/cache work from this branch.

What #270 does:

  • keeps SQLite as the default backend and leaves PostgreSQL optional
  • does not include the PostgreSQL/Neon runtime rewrite from feat(runtime)!: persist response chains on neon postgresql #211
  • restores durable previous_response_id continuity on current main
  • includes the caller/API-key scoping fix, preferred-account reuse, and the early websocket disconnect retry

So if the goal is to keep the session-cache/continuity improvement without changing the repo's DB default, please review #270 instead of trimming #211 in place.

Full credit to @xirothedev for the original continuity implementation and follow-up fixes in #211. #270 just isolates the SQLite-compatible subset onto current main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist previous_response_id chains, require neon postgresql runtime, and normalize API key expirations

5 participants